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Abstract 

For  a  robot  to  be  capable  of  development,  it 
must  be  able  to  explore  its  environment  and 
learn  from  its  experiences.  It  must  find  (or 
create)  opportunities  to  experience  the  unfa¬ 
miliar  in  ways  that  reveal  properties  valid  be¬ 
yond  the  immediate  context.  In  this  paper,  we 
develop  a  novel  method  for  using  the  rhythm 
of  everyday  actions  as  a  basis  for  identifying 
the  characteristic  appearance  and  sounds  as¬ 
sociated  with  objects,  people,  and  the  robot 
itself.  Our  approach  is  to  identify  and  seg¬ 
ment  groups  of  signals  in  individual  modali¬ 
ties  (sight,  hearing,  and  proprioception)  based 
on  their  rhythmic  variation,  then  to  identify 
and  bind  causally- related  groups  of  signals 
across  different  modalities.  By  including  pro¬ 
prioception  as  a  modality,  this  cross-modal 
binding  method  applies  to  the  robot  itself, 
and  we  report  a  series  of  experiments  in  which 
the  robot  learns  about  the  characteristics  of 
its  own  body. 

1.  Introduction 

To  robots  and  young  infants,  the  world  is  a  puzzling 
place,  a  confusion  of  sights  and  sounds.  But  buried 
in  the  noise  there  are  hints  of  regularity.  Some  of 
this  is  natural;  for  example,  objects  tend  to  go  thud 
when  they  fall  over  and  hit  the  ground.  Some  is  due 
to  the  child;  for  example,  if  it  shakes  its  limbs  in 
joy  or  distress,  and  one  of  them  happens  to  pass  in 
front  of  its  face,  it  will  see  a  fleshy  blob  moving  in  a 
familiar  rhythm.  And  some  of  the  regularity  is  due  to 
the  efforts  of  a  caregiver;  consider  an  infant’s  mother 
trying  to  help  her  child  learn  and  develop,  perhaps 
by  tapping  a  toy  or  a  part  of  the  child’s  body  (such 
as  its  hand)  while  speaking  its  name,  or  making  a 
toy’s  characteristic  sound  (such  as  the  bang-bang  of 
a  hammer). 

In  this  paper  we  seek  to  extract  useful  information 
from  repeated  actions  performed  either  by  a  care¬ 
giver  or  the  robot  itself.  Observation  of  infants  shows 
that  such  actions  happen  frequently,  and  from  a  com¬ 


putational  perspective  they  are  ideal  learning  mate¬ 
rial  since  they  are  easy  to  identify  and  offer  a  wealth 
of  redundancy  (important  for  robustness).  The  infor¬ 
mation  we  seek  from  repeated  actions  are  the  char¬ 
acteristic  appearances  and  sounds  of  the  object,  per¬ 
son,  or  robot  involved,  with  context-dependent  infor¬ 
mation  such  as  the  visual  background  or  unrelated 
sounds  stripped  away.  This  allows  the  robot  to  gen¬ 
eralize  its  experience  beyond  its  immediate  context 
and,  for  example,  later  recognize  the  same  object 
used  in  a  different  way. 

We  wish  our  system  to  be  scalable,  so  that  it  can 
correlate  and  integrate  multiple  sensor  modalities 
(currently  sight,  sound,  and  proprioception).  To  that 
end,  we  detect  and  cluster  periodic  signals  within 
their  individual  modalities,  and  only  then  look  for 
cross- modal  relationships  between  such  signals.  This 
avoids  a  combinatorial  explosion  of  comparisons,  and 
means  our  system  can  be  gracefully  extended  to  deal 
with  new  sensor  modalities  in  future  (touch,  smell, 
etc). 

This  paper  begins  by  introducing  our  robotic  plat¬ 
form  and  what  it  can  sense.  We  then  introduce  the 
methods  we  use  for  detecting  regularity  in  individ¬ 
ual  modalities  and  the  tests  applied  to  determine 
when  to  ‘bind’  features  in  different  modalities  to¬ 
gether.  The  remainder  (and  larger  part)  of  the  paper 
presents  experiments  where  the  robot  detects  regu¬ 
larity  in  objects,  people  it  encounters,  and  finally 
itself. 

2.  Platform  and  percepts 

This  work  is  implemented  on  the  humanoid  robot 
Cog  (Brooks  et  al.,  1999).  Cog  has  an  active  vision 
head,  two  six-degree  of  freedom  arms,  a  rotating 
torso,  and  a  microphone  array  arranged  along  its 
shoulders.  For  this  paper,  we  work  with  visual  in¬ 
put  from  one  of  Cog’s  four  cameras,  acoustic  input 
from  the  microphone  array,  and  proprioceptive  feed¬ 
back  from  joints  in  the  head,  torso,  and  arms. 

Figure  1  shows  how  the  robot’s  perceptual  state 
can  be  summarized  -  the  icons  shown  here  will  be 
used  throughout  the  paper.  The  robot  can  detect 
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Figure  1:  A  summary  of  the  possible  perceptual  states  of 
our  robot  -  the  representation  shown  here  will  be  used 
throughout  the  paper.  Events  in  any  one  of  the  three 
modalities  (sight,  proprioception,  or  hearing)  are  indi¬ 
cated  as  in  block  1.  When  two  events  occur  in  different 
modalities,  they  may  be  independent  (top  of  2)  or  bound 
(bottom  of  2).  When  events  occur  in  three  modalities, 
the  possibilities  are  as  shown  in  3. 


periodic  events  in  any  of  the  individual  modalities 
(sight,  hearing,  proprioception).  Any  two  events  that 
occur  in  different  modalities  will  be  compared,  and 
may  be  grouped  together  if  there  is  evidence  that 
they  are  causally  related  or  bound.  Such  relations 
are  transitive:  if  events  A  and  B  are  bound  to  each 
other,  and  B  and  C  are  bound  to  each  other,  then 
A  and  C  will  also  be  bound.  This  is  important  for 
consistent,  unified  perception  of  events. 

This  kind  of  summarization  ignores  cases  in 
which  there  are,  for  example,  multiple  visible  ob¬ 
jects  moving  periodically  making  different  sounds. 
We  return  to  this  point  later  in  the  paper.  We 
have  previously  demonstrated  that  our  system  can 
deal  well  with  multiple- binding  cases,  since  it 
performs  segmentation  in  the  individual  modali¬ 
ties  (Arsenio  and  Fitzpatrick,  2003).  For  this  paper, 
there  is  no  real  need  to  consider  such  cases,  since 
we  don’t  expect  the  robot’s  caregiver  to  maliciously 
introduce  distractors  into  its  environment  -  but  nev¬ 
ertheless  it  is  an  important  feature  of  our  algorithm, 
which  we  now  present. 


Figure  2:  When  watching  a  person  using  a  hammer,  the 
robot  detects  and  group  points  moving  in  the  image  with 
similar  periodicity  (Arsenio  et  al.,  2003)  to  find  the  over¬ 
all  trajectory  of  the  hammer  and  separate  it  out  from  the 
background.  The  detected  trajectory  is  shown  on  the 
left  (for  clarity,  just  the  coordinate  in  the  direction  of 
maximum  variation  is  plotted),  and  the  detected  object 
boundary  is  overlaid  on  the  image  on  the  right. 


3.  Detecting  periodic  events 

We  are  interested  in  detecting  conditions  that  re¬ 
peat  with  some  roughly  constant  rate,  where  that 
rate  is  consistent  with  what  a  human  can  easily  pro¬ 
duce  and  perceive.  This  is  not  a  very  well  defined 
range,  but  we  will  consider  anything  above  10Hz  to 
be  too  fast,  and  anything  below  0.1Hz  to  be  too  slow. 
Repetitive  signals  in  this  range  are  considered  to  be 
events  in  our  system.  For  example,  waving  a  flag  is 
an  event,  clapping  is  an  event,  walking  is  an  event, 
but  the  vibration  of  a  violin  string  is  not  an  event 
(too  fast),  and  neither  is  the  daily  rise  and  fall  of  the 
sun  (too  slow).  Such  a  restriction  is  related  to  the 
idea  of  natural  kinds  (Hendriks- Jansen,  1996),  where 
perception  is  based  on  the  physical  dimensions  and 
practical  interests  of  the  observer. 

To  find  periodicity  in  signals,  the  most  obvi¬ 
ous  approach  is  to  use  some  version  of  the  Fourier 
transform.  And  indeed  our  experience  is  that 
use  of  the  Short-Time  Fourier  Transform  (STFT) 
demonstrates  good  performance  when  applied  to 
the  visual  trajectory  of  periodically  moving  ob¬ 
jects  (Arsenio  et  al.,  2003).  For  example,  Figure  2 
shows  a  hammer  segmented  visually  by  tracking  and 
grouping  periodically  moving  points.  However,  our 
experience  also  leads  us  to  believe  that  this  approach 
is  not  ideal  for  detecting  periodicity  of  acoustic  sig¬ 
nals.  Of  course,  acoustic  signals  have  a  rich  struc¬ 
ture  around  and  above  the  kHz  range,  for  which  the 
Fourier  transform  and  related  transforms  are  very 
useful.  But  detecting  gross  repetition  around  the 
single  Hz  range  is  very  different.  The  sound  gener¬ 
ated  by  a  moving  object  can  be  quite  complicated, 
since  any  constraints  due  to  inertia  or  continuity  are 
much  weaker  than  for  the  physical  trajectory  of  a 
mass  moving  through  space.  In  our  experiments,  we 
find  that  acoustic  signals  may  vary  considerably  in 
amplitude  between  repetitions,  and  that  there  is  sig¬ 
nificant  variability  or  drift  in  the  length  of  the  pe- 
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Figure  3:  Extraction  of  an  acoustic  pattern  from  a  pe¬ 
riodic  sound  (a  hammer  banging).  The  algorithm  for 
signal  segmentation  is  applied  to  each  normalized  fre¬ 
quency  band.  The  box  on  the  right  shows  one  complete 
segmented  period  of  the  signal.  Time  and  frequency  axes 
are  labeled  with  single  and  double  arrows  respectively. 

riods.  These  two  properties  combine  to  reduce  the 
efficacy  of  Fourier  analysis.  This  led  us  to  the  de¬ 
velopment  of  a  more  robust  method  for  periodicity 
detection,  which  is  now  described.  In  the  follow¬ 
ing  discussion,  the  term  signal  refers  to  some  sensor 
reading  or  derived  measurement,  as  described  at  the 
end  of  this  section.  The  term  period  is  used  strictly 
to  describe  event-scale  repetition  (in  the  Hz  range), 
as  opposed  to  acoustic-scale  oscillation  (in  the  kHz 
range). 

Period  estimation  -  For  every  sample  of  the  sig¬ 
nal,  we  determine  how  long  it  takes  for  the  sig¬ 
nal  to  return  to  the  same  value  from  the  same 
direction  (increasing  or  decreasing),  if  it  ever 
does.  For  this  comparison,  signal  values  are 
quantizing  adaptively  into  discrete  ranges.  In¬ 
tervals  are  computed  in  one  pass  using  a  look¬ 
up  table  that,  as  we  scan  through  the  sig¬ 
nal,  stores  the  time  of  the  last  occurrence  of  a 
value/direction  pair.  The  next  step  is  to  find  the 
most  common  interval  using  a  histogram  (which 
requires  quantization  of  interval  values),  giving 
us  an  initial  estimate  peatimate  lor  the  event  pe¬ 
riod.  This  is  essentially  the  approach  presented 
in  (Arsenio  and  Fitzpatrick,  2003).  For  the  work 
presented  in  this  paper,  we  extended  this  method 
to  explicitly  take  into  account  the  possibility  of 
drift  and  variability  in  the  period,  as  follows. 

Clustering  -  The  previous  procedure  gives  us  an 
estimate  peatimate  of  the  event  period.  We  now 
cluster  samples  in  rising  and  falling  intervals  of 
the  signal,  using  that  estimate  to  limit  the  width 
of  our  clusters  but  not  to  constrain  the  distance 
between  clusters.  This  is  a  good  match  with  real 
signals  we  see  that  are  generated  from  human  ac¬ 
tion,  where  the  periodicity  is  rarely  very  precise. 
Clustering  is  performed  individually  for  each  of 
the  quantized  ranges  and  directions  (increasing 
or  decreasing),  and  then  combined  afterwards. 
Starting  from  the  first  signal  sample  not  assigned 
to  a  cluster,  our  algorithm  runs  iteratively  un¬ 


8 


lime  (msecs) 


Figure  4:  Results  of  an  experiment  in  which  the  robot 
could  see  a  car  and  a  cube,  and  both  objects  were  mov¬ 
ing  the  car  was  being  pushed  back  and  forth  on  a  table, 
while  the  cube  was  being  shaken  (it  has  a  rattle  inside). 
By  comparing  periodicity  information,  the  high-pitched 
rattle  sound  and  the  low-pitched  vroom  sound  were  dis¬ 
tinguished  and  bound  to  the  appropriate  object,  as  shown 
on  the  spectrogram.  The  object  segmentations  shown 
were  automatically  determined. 

til  all  samples  are  assigned,  creating  new  clus¬ 
ters  as  necessary.  A  signal  sample  extracted  at 
time  t  is  assigned  to  a  cluster  with  center  c*  if 
||  C{-t  || 2 <  V estimate  1^'  The  cluster  center  is  the 
average  time  coordinate  of  the  samples  assigned 
to  it,  weighted  according  to  their  values. 

Merging  -  Clusters  from  different  quantized  ranges 
and  directions  are  merged  into  a  single  cluster  if 
II  °i  ~  cj  lk<  Pestimate/ 2  where  C\  and  Cj  are  the 
cluster  centers. 

Segmentation  -  We  find  the  average  interval  be¬ 
tween  neighboring  cluster  centers  for  positive  and 
negative  derivatives,  and  break  the  signal  into 
discrete  periods  based  on  these  centers.  Notice 
that  we  do  not  rely  on  an  assumption  of  a  con¬ 
stant  period  for  segmenting  the  signal  into  re¬ 
peating  units.  The  average  interval  is  the  final 
estimate  of  the  signal  period. 

The  output  of  this  entire  process  is  an  estimate  of 
the  period  of  the  signal,  a  segmentation  of  the  sig¬ 
nal  into  repeating  units,  and  a  confidence  value  that 
reflects  how  periodic  the  signal  really  is.  The  period 
estimation  process  is  applied  at  multiple  temporal 
scales.  If  a  strong  periodicity  is  not  found  at  the  de¬ 
fault  time  scale,  the  time  window  is  split  in  two  and 
the  procedure  is  repeated  for  each  half.  This  consti¬ 
tutes  a  flexible  compromise  between  both  the  time 
and  frequency  based  views  of  a  signal:  a  particular 
movement  might  not  appear  periodic  when  viewed 
over  a  long  time  interval,  but  may  appear  as  such  at 
a  finer  scale. 

Figure  2  shows  an  example  of  using  period¬ 
icity  to  visual  segment  a  hammer  as  a  human 
demonstrates  the  periodic  task  of  hammering, 
while  Figure  3  shows  segmentation  of  the  sound 
of  the  hammer  in  the  time-domain.  Segmenta¬ 
tion  in  the  frequency-domain  was  demonstrated 
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Figure  5:  Here  the  robot  is  shown  a  tambourine  in  use. 
The  robot  detects  that  there  is  a  periodically  moving 
visual  source,  and  a  periodic  sound  source,  and  that  the 
two  sources  are  causally  related  and  should  be  bound.  All 
images  in  these  figures  are  taken  directly  from  recordings 
of  real-time  interactions,  except  for  the  summary  box  in 
the  top- left  (included  since  in  some  cases  the  recordings 
are  of  poor  quality).  The  images  on  the  far  right  show  the 
visual  segmentations  recorded  for  the  tambourine  in  the 
visual  modality.  The  background  behind  the  tambourine, 
a  light  wall  with  doors  and  windows,  is  correctly  removed. 
Acoustic  segmentations  are  generated  but  not  shown  (see 
Figures  3  and  4  for  examples). 


in  (Arsenio  and  Fitzpatrick,  2003)  and  is  illustrated 
in  Figure  4).  For  these  examples  and  all  other  ex¬ 
periments  described  in  this  paper,  our  system  tracks 
moving  pixels  in  a  sequence  of  images  from  one  of 
the  robot’s  cameras  using  a  multiple  tracking  algo¬ 
rithm  based  on  a  pyramidal  implementation  of  the 
Lukas-Kanade  algorithm.  A  microphone  array  sam¬ 
ples  the  sounds  around  the  robot  at  16kHz.  The 
Fourier  transform  of  this  signal  is  taken  with  a  win¬ 
dow  size  of  512  samples  and  a  repetition  rate  of 
31.25Hz.  The  Fourier  coefficients  are  grouped  into 
a  set  of  frequency  bands  for  the  purpose  of  further 
analysis,  along  with  the  overall  energy. 

4.  Learning  about  objects 

Segmented  features  extracted  from  visual  and  acous¬ 
tic  segmentations  (using  the  method  presented 
in  last  section)  can  serve  as  the  basis  for  an 
object  recognition  system.  In  the  visual  do¬ 
main,  (Fitzpatrick,  2003)  used  segmentations  de¬ 
rived  through  physical  contact  as  an  opportunity  for 
a  robot  to  become  familiar  with  the  appearance  of 
objects  in  its  environment  and  grow  to  recognize 


them.  (Krotkov  et  al.,  1996)  has  looked  at  recog¬ 
nition  of  the  sound  generated  by  a  single  contact 
event.  Visual  and  acoustic  cues  are  both  individually 
important  for  recognizing  objects,  and  can  comple¬ 
ment  each  other  when,  for  example,  the  robot  hears 
an  object  that  is  outside  its  view,  or  it  sees  an  object 
at  rest.  But  when  both  visual  and  acoustic  cues  are 
present,  then  we  can  do  even  better  by  looking  at  the 
relationship  between  the  visual  motion  of  an  object 
and  the  sound  it  generates.  Is  there  a  loud  bang  at 
an  extreme  of  the  physical  trajectory?  If  so  we  might 
be  looking  at  a  hammer.  Are  the  bangs  at  either  ex¬ 
treme  of  the  trajectory?  Perhaps  it  is  a  bell.  Such 
relational  features  can  only  be  defined  and  factored 
into  recognition  if  we  can  relate  or  bind  visual  and 
acoustic  signals. 

Several  theoretical  arguments  support  the  idea 
of  binding  by  temporal  oscillatory  signal  correla¬ 
tions  (von  der  Malsburg,  1995).  From  a  practical 
perspective,  repetitive  synchronized  events  are  ideal 
for  learning  since  they  provide  large  quantities  of  re¬ 
dundant  data  across  multiple  sensor  modalities.  In 
addition,  as  already  mentioned,  extra  information  is 
available  in  periodic  or  locally-periodic  signals  such 
as  the  period  of  the  signal,  and  the  phase  relation¬ 
ship  between  signals  from  different  senses  -  so  for 
recognition  purposes  the  whole  is  greater  than  the 
sum  of  its  parts. 

Therefore,  a  binding  algorithm  was  developed  to 
associate  cross- modal,  locally  periodic  signals,  by 
which  we  mean  signals  that  have  locally  consistent 
periodicity,  but  may  experience  global  drift  and  vari¬ 
ation  in  that  rhythm  over  time.  In  our  system,  the 
detection  of  periodic  cross-modal  signals  over  an  in¬ 
terval  of  seconds  using  the  method  described  in  the 
previous  section  is  a  necessary  (but  not  sufficient) 
condition  for  a  binding  between  these  signals  to  take 
place.  We  now  describe  the  extra  constraints  that 
must  be  met  for  binding  to  occur. 

For  concreteness  assume  that  we  are  comparing  a 
visual  and  acoustic  signal.  Signals  are  compared  by 
matching  the  cluster  centers  determined  as  in  the 
previous  section.  Each  peak  within  a  cluster  from 
the  visual  signal  is  associated  to  a  temporally  close 
(within  a  maximum  distance  of  half  a  visual  period) 
peak  from  the  acoustic  signal,  so  that  the  sound 
peak  has  a  positive  phase  lag  relative  to  the  visual 
peak.  Binding  occurs  if  the  visual  period  matches 
the  acoustic  one,  or  if  it  matches  half  the  acoustic 
period,  within  a  tolerance  of  60ms.  The  reason  for 
the  second  match  is  that  often  sound  is  generated 
at  the  fastest  points  of  an  object’s  trajectory,  or  the 
extremes  of  a  trajectory,  both  of  which  occur  twice 
for  every  single  period  of  the  trajectory.  Typically 
there  will  be  several  redundant  matches  that  lead 
to  binding  within  a  window  of  the  sensor  data  for 
which  several  sound/ visual  peaks  were  detected.  In 


Figure  6:  In  this  experiment,  the  robot  sees  people  shak¬ 
ing  their  head.  In  the  top  row,  the  person  says  “no,  no, 
no”  in  time  with  his  head-shake.  The  middle  row  shows 
the  recorded  state  of  the  robot  during  this  event  -  it  binds 
the  visually  tracked  face  with  the  sound  spoken.  The 
lower  row  shows  the  state  during  a  control  experiment, 
when  a  person  is  just  nodding  and  not  saying  anything. 
Recorded  segmentations  for  these  experiments  are  shown 
on  the  right. 


(Arsenio  and  Fitzpatrick,  2003),  we  describe  a  more 
sophisticated  binding  method  that  can  differentiate 
causally  unconnected  signals  with  periods  that  are 
similar  just  by  coincidence,  by  looking  for  a  drift  in 
the  phase  between  the  acoustic  and  visual  signal  over 
time,  but  such  nuances  are  less  important  in  a  benign 
developmental  scenario  supported  by  a  caregiver. 

Figure  5  shows  an  experiment  in  which  a  person 
shook  a  tambourine  in  front  of  the  robot  for  a  while. 
The  robot  detected  the  periodic  motion  of  the  tam¬ 
bourine,  the  rhythmic  rise  and  fall  of  the  jangling 
bells,  and  bound  the  two  signals  together  in  real¬ 
time. 


Figure  7:  Once  the  cross- modal  binding  system  was  in 
place,  the  authors  started  to  have  fun.  This  figure  shows 
the  result  of  one  author  jumping  up  and  down  like  crazy 
in  front  of  the  robot.  The  thud  as  he  hit  the  floor  was 
correctly  bound  with  segmentations  of  his  body  (column 
on  right).  The  bottom  row  shows  segmentations  from  a 
similarly  successful  experiment  where  the  other  author 
started  applauding  the  robot. 


to  be  understood  by  human  infants  at  around  10-12 
months  (American  Academy  Of  Pediatrics,  1998). 

Sometimes  a  person’s  motion  causes  sound,  just  as 
an  ordinary  object’s  motion  might.  Figure  7  shows 
a  person  jumping  up  and  down  in  front  of  Cog.  Ev¬ 
ery  time  he  land  on  the  floor,  there  is  a  loud  bang, 
whose  periodicity  matches  that  of  the  tracked  visual 
motion.  We  expect  that  there  are  many  situations 
like  this  that  the  robot  can  extract  information  from, 
despite  the  fact  that  those  situations  were  not  con¬ 
sidered  during  the  design  of  the  binding  algorithms. 
The  images  in  all  these  figures  are  taken  from  online 
experiments  -  no  offline  processing  is  done. 

6.  Learning  about  the  self 


5.  Learning  about  people 

In  this  section  we  do  not  wish  to  present  any  new  al¬ 
gorithms,  but  rather  show  that  the  cross- modal  bind¬ 
ing  method  we  developed  for  object  perception  also 
applies  to  perceiving  people.  Humans  often  use  body 
motion  and  repetition  to  reinforce  their  actions  and 
speech,  especially  with  young  infants.  If  we  do  the 
same  in  our  interactions  with  Cog,  then  it  can  use 
those  cues  to  link  visual  input  with  corresponding 
sounds.  For  example,  Figure  6  shows  a  person  shak¬ 
ing  their  head  while  saying  “no!  no!  no!”  in  time  to 
his  head  motion.  The  figure  shows  that  the  robot  ex¬ 
tracts  a  good  segmentation  of  the  shaking  head,  and 
links  it  with  the  sound  signal.  Such  actions  appear 


So  far  we  have  considered  only  external  events  that 
do  not  involve  the  robot.  In  this  section  we  turn  to 
the  robot’s  perception  of  its  own  body.  Cog  treats 
proprioceptive  feedback  from  its  joints  as  just  an¬ 
other  sensory  modality  in  which  periodic  events  may 
occur.  These  events  can  be  bound  to  the  visual  ap¬ 
pearance  of  its  moving  body  part  -  assuming  it  is 
visible  -  and  the  sound  that  the  part  makes,  if  any 
(in  fact  Cog’s  arms  are  quite  noisy,  making  an  audi¬ 
ble  “whirr- whirr”  when  they  move  back  and  forth). 

Figure  8  shows  a  basic  binding  experiment,  in 
which  a  person  moved  Cog’s  arm  while  it  is  out  of 
the  robot’s  view.  The  sound  of  the  arm  and  the 
robot’s  proprioceptive  sense  of  the  arm  moving  are 
bound  together.  This  is  an  important  step,  since  in 


robot  is  looking  away  from 
its  arm  as  human  moves  it 


Figure  8:  In  this  experiment,  a  person  grabs  Cog’s  arm 
and  shakes  it  back  and  forth  while  the  robot  is  looking 
away.  The  sound  of  the  arm  is  detected,  and  found  to  be 
causally  related  to  the  proprioceptive  feedback  from  the 
moving  joints,  and  so  the  robot’s  internal  sense  of  its  arm 
moving  is  bound  to  the  external  sound  of  that  motion. 

the  busy  lab  Cog  inhabits,  people  walk  into  view  all 
the  time,  and  there  are  frequent  loud  noises  from  the 
neighboring  machine  shop.  So  cross-modal  rhythm 
is  an  important  cue  for  filtering  out  extraneous  noise 
and  events  of  lesser  interest. 

In  Figure  9,  the  situation  is  similar,  with  a  per¬ 
son  moving  the  robot’s  arm,  but  the  robot  is  now 
looking  at  the  arm.  In  this  case  we  see  our  first 
example  of  a  binding  that  spans  three  modalities: 
sight,  hearing,  and  proprioception.  The  same  is  true 
in  Figure  10,  where  Cog  shakes  its  own  arm  while 
watching  it  in  a  mirror.  This  idea  is  related  to  work 
in  (Metta  and  Fitzpatrick,  2003),  where  Cog  located 
its  arm  by  shaking  it. 

An  important  milestone  in  child  development  is 
reached  when  the  child  recognizes  itself  as  an  indi¬ 
vidual,  and  identifies  its  mirror  image  as  belonging  to 
itself  (Rochat  and  Striano,  2002).  Self- recognition 
in  a  mirror  is  also  the  focus  of  extensive  study 
in  biology.  Work  on  self- recognition  in  mirrors 
for  chimpanzees  (Gallup  et  al.,  2002)  suggests  that 
animals  other  than  humans  can  also  achieve  such 
competency,  although  the  interpretation  of  such  re¬ 
sults  requires  care  and  remains  controversial.  Self¬ 
recognition  is  related  to  the  notion  of  a  theory-of- 
mind,  where  intents  are  assigned  to  other  actors, 
perhaps  by  mapping  them  onto  oneself,  a  topic  of 
great  interest  in  robotics  (Kozima  and  Yano,  2001, 
Scassellati,  2001).  Proprioceptive  feedback  provides 
very  useful  reference  signals  to  identify  appearances 


robot  is  looking  towards  its 
arm  as  human  moves  it 


Figure  9:  In  this  experiment,  a  person  shakes  Cog’s  arm 
in  front  of  its  face.  What  the  robot  hears  and  sees  has  the 
same  rhythm  as  its  own  motion,  so  the  robot’s  internal 
sense  of  its  arm  moving  is  bound  to  the  sound  of  that 
motion  and  the  appearance  of  the  arm. 

of  the  robot's  body  in  different  modalities.  That  is 
why  we  extended  our  binding  algorithm  to  include 
proprioceptive  data. 

Children  between  12  and  18  months  of  age  be¬ 
come  interested  in  and  attracted  to  their  reflec¬ 
tion  (American  Academy  Of  Pediatrics,  1998).  Such 
behavior  requires  the  integration  of  visual  cues  from 
the  mirror  with  proprioceptive  cues  from  the  child’s 
body.  As  shown  in  Figure  11,  the  binding  algo¬ 
rithm  was  used  not  only  to  identify  the  robot’s 
own  acoustic  rhythms,  but  also  to  identify  visu¬ 
ally  the  robot’s  mirror  image  (an  important  mile¬ 
stone  in  the  development  of  a  child’s  theory  of 
mind  (Baron-Cohen,  1995)).  It  is  important  to  stress 
that  we  are  dealing  with  the  low-level  perceptual  chal¬ 
lenges  of  a  theory  of  mind  approach,  rather  than  the 
high-level  inferences  and  mappings  involved.  Cor¬ 
relations  of  the  kind  we  are  making  available  could 
form  a  grounding  for  a  theory  of  mind  and  body¬ 
mapping,  but  are  not  of  themselves  part  of  a  the¬ 
ory  of  mind  for  example,  they  are  completely  un¬ 
related  to  the  intent  of  the  robot  or  the  people 
around  it,  and  intent  is  key  to  understanding  oth¬ 
ers  in  terms  of  the  self  (Kozima  and  Zlatev,  2000, 
Kozima  and  Yano,  2001).  Our  hope  is  that  the  per¬ 
ceptual  and  cognitive  research  will  ultimately  merge 
and  give  a  truly  intentional  robot  that  understands 
others  in  terms  of  its  own  goals  and  body  image  - 
an  image  which  could  develop  incrementally  using 
cross-modal  correlations  of  the  kind  explored  in  this 
paper. 


robot  moves  its  arm  while  arm 
looking  in  a  mirror  segmentations 


appearance,  sound, 
and  action  of  the  arm 
all  bound  together 


Figure  10:  In  this  experiment,  Cog  is  looking  at  itself 
in  a  mirror,  while  shaking  its  arm  back  and  forth  (the 
views  on  the  right  are  taken  by  a  camera  behind  the 
robot’s  left  shoulder,  looking  out  with  the  robot  towards 
the  mirror).  The  reflected  image  of  its  arm  is  bound  to 
the  robot’s  sense  of  its  own  motion,  and  the  sound  of  the 
motion.  This  binding  is  identical  in  kind  to  the  bind¬ 
ing  that  occurs  if  the  robot  sees  and  hears  its  own  arm 
moving  directly  without  a  mirror.  However,  the  appear¬ 
ance  of  the  arm  is  from  a  quite  different  perspective  than 
Cog’s  own  view  of  its  arm. 


7.  Discussion  and  conclusions 

Most  of  us  have  had  the  experience  of  feeling  a  tool 
become  an  extension  of  ourselves  as  we  use  it  (see 
(Stoytchev,  2003)  for  a  literature  review).  Many  of 
us  have  played  with  mirror-based  games  that  distort 
or  invert  our  view  of  our  own  arm,  and  found  that  we 
stop  thinking  of  our  own  arm  and  quickly  adopt  the 
new  distorted  arm  as  our  own.  About  the  only  form 
of  distortion  that  can  break  this  sense  of  ownership  is 
a  delay  between  our  movement  and  the  proxy-arm’s 
movement.  Such  experiences  argue  for  a  sense  of  self 
that  is  very  robust  to  every  kind  of  transformation 
except  latencies.  Our  work  is  an  effort  to  build  a  per¬ 
ceptual  system  which,  from  the  ground  up,  focuses 
on  timing  just  as  much  as  content.  This  is  powerful 
because  timing  is  truly  cross-modal,  and  leaves  its 
mark  on  all  the  robot’s  senses,  no  matter  how  they 
are  processed  and  transformed. 

We  are  motivated  by  evidence  from  human  percep¬ 
tion  that  strongly  suggests  that  timing  information 
can  transfer  between  the  senses  in  profound  ways. 
For  example,  experiments  show  that  if  a  short  frag¬ 
ment  of  white  noise  is  recorded  and  played  repeat¬ 
edly,  a  listener  will  be  able  to  hear  its  periodicity. 
But  as  the  fragment  is  made  longer,  at  some  point 
this  ability  is  lost.  But  the  repetition  can  be  heard 


Figure  11:  Cog  can  be  shown  different  parts  of  its  body 
simply  by  letting  it  see  that  part  (in  a  mirror  if  necessary) 
and  then  shaking  it,  such  as  its  (right)  hand  or  (left) 
flipper.  Notice  that  this  works  for  the  head,  even  though 
shaking  the  head  also  affects  the  cameras. 


Figure  12:  This  figure  shows  a  real-time  view  of  the 
robot’s  status  during  the  experiment  in  Figure  9.  The 
robot  is  continually  collecting  visual  and  auditory  seg¬ 
mentations,  and  checking  for  cross-model  events.  It  also 
compares  the  current  view  with  its  database  and  per¬ 
forms  object  recognition  to  correlate  with  past  experience 
(bottom  right). 


for  far  longer  fragments  if  a  light  is  flashed  in  syn¬ 
chrony  with  it  (Bashford  et  al.,  1993)  -  flashing  the 
light  actually  changes  how  the  noise  sounds.  More 
generally,  there  is  evidence  that  the  cues  used  to 
detect  periodicity  can  be  quite  subtle  and  adaptive 
(Kaernbach,  1993),  suggesting  there  is  a  lot  of  po¬ 
tential  for  progress  in  replicating  this  ability  beyond 
the  ideas  already  described. 

Although  there  is  much  to  do,  from  a  practical  per¬ 
spective  a  lot  has  already  been  accomplished.  Con¬ 
sider  Figure  12,  which  shows  a  partial  snapshot  of  the 
robot’s  state  during  one  of  the  experiments  described 
in  the  paper.  The  robot’s  experience  of  an  event  is 
rich,  with  many  visual  and  acoustic  segmentations 
generated  as  the  event  continues,  relevant  prior  seg¬ 
mentations  recalled  using  object  recognition,  and  the 
relationship  between  data  from  different  senses  de- 


tected  and  stored.  We  believe  that  this  kind  of  expe¬ 
rience  will  form  one  important  part  of  a  perceptual 
toolbox  for  autonomous  development,  where  many 
very  good  ideas  have  been  hampered  by  the  difficulty 
of  robust  perception. 

Another  ongoing  line  of  research  we  are  pursuing 
is  truly  cross-modal  object  recognition.  A  hammer 
causes  sound  after  striking  an  object.  A  toy  truck 
causes  sound  while  moving  rapidly  with  wheels  spin¬ 
ning;  it  is  quiet  when  changing  direction  -  therefore, 
the  car’s  acoustic  frequency  is  twice  as  much  as  the 
frequency  of  its  visual  trajectory.  A  bell  typically 
causes  sound  at  either  extreme  of  motion.  All  these 
statements  are  truly  cross-modal  in  nature,  and  with 
our  system  we  can  begin  to  use  such  properties  for 
recognition. 
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